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Sec No. 07/933,865, filed Aug. 21, 1993, and now 
abandoned, a continuation of U.S. patent application Sec 
No. 435^91 filed Nov. 17, 1989 and now abandoned; 

U.S. Fat No. 5,212,777, issued May 18, 1993, filed Nov. 
17, 1989 and entitled "SJMD/MIMD RECONFIGURABLE 
MULTI-PROCESSOR AND METHOD OF OPERATION"; 
U.S. patent application Sec No. 08/264,111 fiied Juil 22, 
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HONS FOR MULTI-PROCESSOR AND METHOD OF 
OPERATION* a continuation cf VS. patent application 
Sec No. 07/895^65, filed Jan. 5, 1992, and now abandoned, 
a continuation of U.S. patent application Sec No. 07/437, 
856, filed Nov. 17, 1989 and now abandoned; 

U.S. patent application Sec No. 08/264,582, filed Jun. 22, 
1994, entitled "REDUCED AREA OF CROSSBAR AND 
METHOD OF OPERATION", a continuation of U.S. patent 
application Sec No. 07/437,852, filed Nov. 17, 1989, and 
now abandoned; 

VS. patent application Sec No. 08/032^30 filed Mar. 15, 
1993 entitled "SYNCHRONIZED MIMD MULTI- 
PROCESSING SYSTEM AND METHOD OF 
OPERATION," a continuation of U.S. patent application 



DmONAL SUBTRACTION FOR CONVERSION OF 
NEGATIVE NUMBERS"; 

VS. patent application Sec No. 08/160,112, 'METHOD, 
APPARATUS AND SYSTEM FOR SUM OF PLURAL 
30 ABSOLUTE DIFFERENCES", and now pending; 

U.S. patent application Sec No. 08/160,120, •ITERA- 
TIVE DIVISION APPARATUS, SYSTEM AND METHOD 
EMHjOYING LEFT MOST ONE'S DETECTION AND 
LEFT MOST ONE'S DETECTION WITH EXCLUSIVE 
35 OR**, and now pending; 

VS. patent application Sec No. 08/160,114, "ADDRESS 
GENERATOR EMPLOYING SELECTIVE MERGE OF 
TWO INDEPENDENT ADDRESSES", and now pending; 
VS. Pat No, 5,420,809, "METHOD, APPARATUS AND 



Sec No. 07/437353 filed Nov. 17, 1989 and now aban- 40 SYSTEM METHOD FOR CORRELATION"; 



doned; 

U.S. Fat No. 5,197,140 issued Mac 23, 1993 filed Nov. 
17, 1989 and entitled "SLICED ADDRESSING MULTI- 
PROCESSOR AND METHOD OF OPERATION"; 

US. Fat No. 5339,447, issued Aug. 16, 1994, filed Nov. 
17, 1989 entitled "ONES COUNTING CIRCUIT, UTILIZ- 
ING A MATRIX OF INTERCONNECTED HALF- 
ADDERS, FOR COUNTTNGTHB NUMBER OF ONES IN 
A BINARY STRING OF IMAGE DATA; 

U.S. Fat No. 5,239,654 issued Aug. 24, 1993 filed Nov. 
17, 1989 and entitled *DUAL MODE SEMD/MIND PRO- 
CESSOR PROVIDING RfiUSE OF MIMD INSTRUC- 
TION MEMORIES AS DAIA MEMORIES WHEN OPER- 
ATING IN SAID MODE**; 

U.S. Pat No. 5,410,649, filed Jun. 29, 1992 entitled 
"IMAGING COMPUTER AND METHOD OF 
OPERATION", a continuation of U.S. patent application 
Set. No. 07/437,854, filed Nov. 17, 1989 and now aban- 
doned; and 

U.S. Fat No. 5,226,125 issued Jul. 6, 1993 filed Nov. 17, 
1989 and entitled "SWITCH MATRIX HAVING INTB- 
GRATED CROSSPOINT LOGIC AND METHOD OF 
OPERATION". 
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U.S. Pat. No. 5,509,129, **LONG INSTRUCTION 
WORD CONTROLLING PLURAL INDEPENDENT 
PROCESSOR OPERATIONS"; 

VS. patent application Set No. 08/159346, "ROTATION 
REGISTER FOR ORTHOGONAL DATA TRANSFOR- 
MATION**; and now pending; 

UJS. patent application Ser. No. 08/159,652, "MEDIAN 
FILTER METHOD, CIRCUIT AND SYSTEM", and now 
pending; 

U.S. patent application Ser. No. 08/159344, "ARITH- 
METIC LOGIC UNIT WITH CONDITIONAL REGISTER 
SOURCE SELECTION and now pending; 

U.S. patent a pplic ation Ser. No. 08/160,301, 
"APPARATUS, SYSTEM AND METHOD FOR DIVI- 
SION BY ITERATION", and now pending; 

VS. patent application Set. No. 08/159,650, "MUMTPLY 
ROUNDING USING REDUNDANT CODED MULTIPLY 
RESULT, and now pending; 

UJS. Pat No. 5/446,651, "SPLIT MULTIPLY OPERA- 
TION"; 

VS. patent application Set Na 08,432,697, filed Jun. 7, 
1995, "MIXED CONDITION TEST CONDITIONAL AND 
BRANCH OPERATIONS INCLUDING CONDITIONAL 



This application is also related to the following concur- 65 TEST FOR ZERO", a continuation of U.S. patent 
rentry filed U.S. patent applications, which include the same tion Ser. No. 08/158,741, concurrently filed with this appli- 
disclosure: cation and now abandoned; 
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US. patent application Sen No. 08/160302, TACKED 
WORD PAIR MULTIPLY OPERATION", and now aban- 
doned; 

tLS. patent application Ser. No. 08/160,573, 'THREE 
INPUT ARTTHMETIC LOGIC UNIT WITH SHIFTER", 
and now pending; 

U.S. patent application Sec No. 08/159,282, 'THREE 
INPUT ARITHMETIC LOGIC UNIT WITH MASK 
GENERATOR", and now pending; 

US. patent application Ser. No. 08/160,111, 'THREE 
INPUT ARITHMETIC LOGIC UNIT WITH BARREL 
ROTATOR AND MASK GENERATOR", and now pending; 

U.S. patent application Ser. No. 08/160,298, 'THREE 



INPUT ARITHMETIC LOGIC UNIT WITH SHIFTER 15 elements, such as alpl 
AND MASK GENERATOR", and now ponding; 

US. Pat No. 5,485,411, THREE INPUT ARITHMETIC 
LOGIC UNIT FORMING THE SUM OF A FIRST INPUT 
ADDED WITH A FIRST BOOLEAN COMBINATION OF 
A SECOND INPUT AND THIRD INPUT PLUS A SEC- 
OND BOOLEAN COMBINATION OF THE SECOND 
AND THIRD INPUTS"; • 

US. Pat No. 5,465,224, *THREB INPUT ARITHMETIC 
LOGIC UNIT FORMING THE SUM OF FIRST BOOL- 



computing are unsuitable for bit mapped graphics systems. 
Consequently some routine graphics tasks operated slowly. 
In addition, it was quickly discovered mat the processing 
needed for image manipulation of bit mapped graphics was 
5 so loading me computational capacity of the system proces- 
sor that other operations woe also slowed. 

The next step in the evolution of bit mapped graphics 
processing was dedicated hardware graphics controllers. 
These devices can draw simple figures, such as lines, 
10 ell^ses and circles, under the control of the system proces- 
sor. Many of these devices can also do pixel block transfers 
(PixBlt). A pixel block transfer is a memory move operation 
of image data from one portion of memory to another. A 
pixel block transfer is useful for rendering standard image 



[c characters in a particular 
type font, within a display by transfer from nondisplayed 
memory to bit mapped display memory. This function can 
also be used for tiling by transferring the same «™il image 
to the whole of bit mapped display memory. The built-in 
algorithms far performing some of the most frequently used 
graphics functions provide a way of improving system 
performance. However, a useful graphics computer system 
often requires many functions besides those few that are 

_ _ implemented in such a hardware graphics controller. These 

BAN COMBINATION OF FIRST^ SECOND AND THIRD 25 additional functions must be implemented in software by the 
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INPUTS PLUS A SECOND BOOLEAN COMBINATION 
OF FIRST, SECOND AND THIRD INPUTS**; 

U^l. Pat No. 5,493 ,524, THREE INPUT ARITHMETIC 
LOGIC UNIT EMPLOYING CARRY PROPAGATE 
LOGIC, a continuation of U.S. patent application Ser. No. 
OS/159,640, filed concurrently with mis application and now 
abandoned; and 

US. patent application Ser; No. 08/160300, **DATA 
PROCESSING APPARATUS, SYSTEM AND METHOD 
FOR IF, THEN, ELSE OPERATION USING WRITE 
PRIORITY", and now pending. 

TECHNICAL FIELD OF THE INVENTION 



system processor Typically these hardware {paphics con- 
trollers allow the system processor only limited access to the 
bit map memory, thereby limiting the degree to whic 
system software can augment the fixed set of functions of the 
30 hardware graphics controller. 

The graphics system processor represents yet a further 
step in the evolution of bit mapped graphics processing. A 
graphics system processor is a programmable device that has 
all the attributes of a microprocessor and also includes 
35 special functions for bit mapped graphics. The TMS34010 
and TMS34020 graphics system processors manufactured 
by Texas mstruments Incorporated represent this class of 
devices. These graphics system processors respond to a 



stored program in the same manner as a inicroprocessor and 
The technical field of this invention is me field of digital 40 inch*** th* rgpahmty «f fat* mn mpn iatjnn via an writhnwiV 



data processing and more particularly micro; 
circuits, architectures and methods for digital data r 
ing especially digital image/graphics processing. 

BACKGROUND OF THE INVENTION 



mo 



issor 
sss- 



Tlris invention relates to the field of computer graphics 
and in particular to bit mapped graphics. In bit mapped 
graphics computer memory stores data for each individual 
picture element or pixel of an image at memory locations 



logic unit, data storage in register files and control of b 
program flow and external data memory. In addition, these 
devices include special purpose graphics manipulation hard- 
ware that operate under pr o gram control. Additional instruc- 
45 tions within the instruction set of these graphics system 
processors controls the special purpose graphics hardware. 
These instructions and the hardware that supports them are 
selected to perform base level graphics functions mat are 
useful in many contexts. Thus a graphics system processor 
that correspond to the location of that pixel within the image, so can be programmed for many differing graphics applications 
This image may be an image to be displayed or a captured using algorithms selected for the particular problem. This 
image to be manipulated, stored, displayed or retransmitted. provides an increase in usefulness similar to that provided 
The field of bit mapped computer graphics has benefited by changing from hardware controllers to programmed 
greatly from the lowered cost and increased capacity of microprocessors. Because such graphics system processors 
dynamic random access memory (DRAM) and the lowered 55 are programmable devices in the same manner as 
cost and increased processing power of mkroprocessors. inicrojrocessors, they can operate as stand alone graphics 
These advantageous changes in the cost and performance of processors, graphics co-processors slaved to a system pro- 
component parts enable larger and more complex computer cesser or tightly coupled graphics controllers, 
image systems to be economically feasible. New applications are driving the desire to provide more 

The field of bit mapped graphics has undergone several 60 powerful graphics functions. Several fields require more 
stages devolution of the types of cost effective graphics operations to be economically fea- 

data manipulation. Initially a computer system supporting sible. These include video conferencing, multi-media corn- 
bit mapped graphics employed the system processor for all puling with full motion video, high definition television, 
bit mapped operations. This type of system suffered several color facsimile and digital photography. Each of these fields 
drawbacks. First, the computer system processor was not 65 presents unique problems, but image data compression and 
particularly designed for handling bit mapped graphics. decompression are common themes. The amount of trans- 
Design choices that are very reasonable for general purpose mission bandwidth and the amount of storage capacity 
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ABSTRACT 



A compressed instruction format for a VLIW processor 
allows greater efficiency in use of cache and memory. 
Instructions are byte aligned and variable length. Branch 
targets are uncompressed. Format bits specify how many 
issue slots are used in a following instruction. NOPS are not 
stored in memory. Individual operations are compressed 
according to features such as whether they are resultless, 
guarded, short, zeroary, unary, or binary. Instructions are 
stored in compressed form in memory and in cache. Instruc- 
tions are decompressed on the fly after being read out from 
cache. 
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ABSTRACT 



A data processing system includes branch prediction appa- 
ratus for storing branch data in a branch prediction RAM 
after each branch has occurred. The RAM interfaces with 
branch logic means which tracks whether a branch is in 
progress and if a branch was guessed. An operational code 
compression means forms each instruction into a new opera- 
tion code of lesser bits and embeds a guess bit into the new 
operational code. Control means decode the compressed 
operational code as an input to an instruction execution unit 
whereby conditional branch occurs based on the guess bit 
provided a branch instruction is not in progress in the 
system. 

9 Claims, 3 Drawing Sheets 
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DATA PROCESSING SYSTEM HAVING 
PREDICTION BY USING AN EMBEDDED 
GUESS BIT OF REMAPPED AND 
COMPRESSED OPCODES 

5 

This invention was made with Government support 
under contract number F29601-87-C-0006, awarded by the 
Department of the Air Force. The Government has certain 
rights in this invention. 

10 

BACKGROUND OF THE INVENTION 

1. Technical Field 

The invention disclosed broadly relates to digital com- 
puter processing systems and more particularly to pipelined 15 
data processing systems including branch prediction. 

2. Background Art 

Data processing systems generally include a central pro- 
cessor, associated storage systems and peripheral devices 
and interfaces. Typically the main memory consists of 20 
relatively low cost, high capacity, digital storage devices. 
The peripheral devices may be, for example, nonvolatile, 
semi-permanent storage media such as magnetic disks and 
magnetic tape drives. In order to carry out tasks, the central 
processor of such a system executes a succession of instruc- 25 
tions which operate on the data. The succession of instruc- 
tions and the data those instructions reference are referred to 
as a program. 

In the operation of such systems, programs are initially 3Q 
brought to an intermediate storage area, usually in the main 
memory. The central processor may then interface directly to 
the main memory to execute the stored program. However, 
this procedure places limitations on performance due prin- 
cipally to the relative long times required in accessing that 35 
main memory. To overcome these limitations, a high speed 
storage system, in some cases called a cache is used to hold 
currently used portions of program within the central pro- 
cessor itself. The cache interfaces with the main memory 
through memory control hardware which handles program ^ 
transfers between the central processor main memory and 
the peripheral device interfaces. 

One form of computer has been developed in the prior art 
to concurrently process a succession of instructions in a 
so-called pipeline manner. In such pipeline processors, each 45 
instruction is executed in part at each of a succession of 
stages. After the instruction has been processed at each of 
the stages, the execution is complete. With this configura- 
tion, an instruction is passed from one stage to the next. That 
instruction is replaced by the next instruction in the program. 50 
Thus, the stages together form a pipeline which at any given 
time, is executing in part, a succession of instructions. Such 
instruction pipelines, processing a plurality of instructions in 
parallel, are found in several digital computing systems. 
These processors consist of a single pipeline of varying 55 
length and employ hardwired logic for all data manipulation. 
The large quantity of control logic in such machines is 
difficult to handle, for example, conditional branch instruc- 
tions, make them extremely fast, but also very expensive. 

The present invention relates to branch prediction mecha- 60 
nisms for handling conditional branch instructions in a 
computer system. When a branch instruction is encountered, 
it is wasteful of the computer resource to wait for resolution 
of the instruction before proceeding with the next program- 
ming step. Therefore, it is a known advantage to provide a 65 
prediction mechanism to predict in advance the instruction 
to be taken as a result of a conditional branch. If the 
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prediction is successful, it allows a computer system to 
function without a delay in processing time. There is a time 
penalty if the prediction is incorrect. Therefore an object of 
the present invention is to provide an improved branch 
prediction mechanism with a high prediction accuracy to 
minimize the time loss caused by incorrect predictions. 

In most pipeline processors, conditional branch instruc- 
tions are resolved in the execution unit. Hence, there are 
several cycles of delay between the decoding of a condi- 
tional branch instruction and its execution. In an attempt to 
overcome the potential loss of these cycles, the decoder 
guesses as to which instructions to decode next. Many 
pipeline processors classify branches according to an 
instruction field. When a branch is decoded, the outcome of 
the branch is predicted, based on its class. 

An example of a prior art branch prediction scheme is 
disclosed in U.S. Pat. No. 4,477,872 to Losq, et al. which 
patent is assigned to the assignee of the present invention. 
The method disclosed predicts the outcome of a conditional 
branch instruction based on the previous performance of the 
branch, rather than on the instruction fields. The prediction 
of the outcome of a conditional branch is performed utilizing 
a table which records a history of the outcome of the branch 
at a given memory location. The disclosed method predicts 
only the branch outcomes and not the address targets for 
prefetching an instruction. The present invention is related to 
patent application Scr. No. 077783,060 entitled "Synchro- 
nizing a Prediction RAM," assigned to the assignee of the 
present invention, filed Oct. 25, 1991, its teachings are 
herein incorporated by reference. Disclosed is a high speed, 
pipelined CPU which breaks large execution flows into 
stages to allow a dramatic improvement in the system 
latency between registers. The multitude of stages allow 
better observability for testing and debugging of the overall 
system. 

The performance enhancement of the pipeline processor 
is dependent on the degree to which each stage of the 
pipeline is kept busy processing its instructions and passing 
the results onto the next stage. In an ideal environment, each 
instruction would pass through a new stage every clock 
cycle. With this assumption, instruction execution time 
would be equal to the clock cycle time after the start-up 
latency has filled the pipeline. A serious degradation of 
pipeline performance improvement can result when branch 
instructions cause the pipeline to be flushed and restarted 
with a new instruction stream, It is desirable to know the 
result of a conditional branch instruction when instructions 
are being fetched Unfortunately, this is not always possible, 
because conditional branches are often dependent on the 
instruction immediately preceding them in the pipeline. 

OBJECTS OF THE INVENTION 

It is therefore an object to provide a highly accurate 
branch prediction. 

It is another object of the invention to provide for instruc- 
tion operation compression within the computer processing 
unit. 

SUMMARY OF THE INVENTION 

The present invention employs the least significant eight 
bits from the memory address used to address a RAM. 
Assuming repeatability in programming, a decision has been 
made to guess that the branch will resolve in the same way 
the previous branch to a given address was decided. This is 
done by using the memory address to read a RAM which 
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was written with branch data after the branch has been 
resolved Rather than the entire memory address, only the 
lower eight bits are used. This provides a good trade-off 
between hardware, which dramatically increases the number 
of bits used to address the prediction RAM and performance 5 
of the device. 

Along with branch prediction, an operations instruction 
code has been compressed from a 12-bit to an eight-bit 
mapping to provide a 160 operations to be derived from 62 
operational codes. This reduces the needed ROM space from 10 
512-bytc ROM to a 256-bytc ROM, which represents sig- 
nificant savings in hardware size and speed. 
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EMBODIMENT 
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These and other objects, features and advantages of the 
present invention will be more fully appreciated with refer- 
ence to the accompanying figures. 

FIG. 1 is a schematic diagram of a typical computer 
system employing central processing units tied to commu- 20 
nication buses. 

FIG. 2 is a logic diagram showing the implementation of 
the present invention. 

FIG. 3 is a block diagram of the branch prediction RAM 
logic. 

FIG. 4 is a table showing the operations code compression 
scheme of the present. 



25 



30 



40 



An example of a typical computer system embodying the 
present invention is shown in FIG. 1. Address processor 12 
reads instructions from the main memory 10 and dispatches 
commands to execution elements such as fixed point pro- 35 
cessor 18 and floating point processor 16, or the address 
translator 14. The address processor 12 sources the instruc- 
tion bus (I-bus) 13 which issues service requests to the 
execution elements. Any general purpose petition updating 
is done across the put-away bus 15. 

Assuming repeatability in programming, it was decided to 
implement the best guess that the conditional branch would 
be resolved the same way that the previous conditional 
branch to a given address was decided. This is done by using 
the memory address to read a RAM that is written with 
branch data after the branch has been resolved. Rather than 
the entire memory address, only the lower eight bits are 
used. This provides a good trade-off between hardware, 
which increases dramatically with the number of bits used to 50 
address the prediction RAM, and performance. 

Shown in FIG. 2 is an implementation in detail of the 
main memory bus 21 from which the least significant eight 
bits have been input into a prediction controller 20, which is 
a 256-bit RAM. Controller 20 interfaces with the branch 55 
logic 22. A determination of branch in progress (BIP) is 
made in section 24. If a branch is in progress a guess 
prediction is made in unit 26. The least significant 8 bits 
from the memory address are used to address the RAM 20. 
Branch logic tracks whether a branch was guessed and if a eo 
branch is currently in progress. A significant speed and 
hardware enhancement to the implementation of this branch 
prediction is the inclusion of the guess in the formation of 
the operations code. 

Shown in FIG. 3 is a block diagram of the address 65 
processor of the present invention. Control logic 30 contains 
an operations code compression section 32 and a branch 



RAM logic 34. Address generators 36 output and receive 
memory and logical addresses to the computer system. 
Instruction bus 13 is connected to the branch RAM and logic 
unit 34. Instruction execution ROM 40 interfaces with the 
instruction bus and decodes the instructions in decode ROM 
42. Instruction register 44 receives as an input memory data 
through precode RAM 48 from instruction file 46. The 
memory data in register 50 interfaces with the memory data 
in the logic control chip 30. Put-away bus 15 handles data 
and addresses at general purpose register 52 shown inter- 
facing with the control logic 30. 

The microcode for a given instruction is executed by first 
passing the instruction code through a pre-decode RAM 48 
which produces the first microword for all instructions. 
Further microwords for given instructions are produced in 
the instruction microcode ROM 42. The use of microcodes 
is a characteristic of a Complex Instruction Set Computer 
(CISC) architecture. It allows a variety of instructions to be 
decoded with a minimal amount of hardware. While not as 
fast as hardwired solutions, the microcode ROMs have a 
relatively quick decode time. Imbedding the guess bit of 
branch prediction in the microcode address (the compressed 
operation code) for jump operations to be decoded leads to 
a fast/simple decode, including the target address consistent 
with the guess. 

The operational code 60 is manipulated for the instruc- 
tions in the opcode compression unit 32, This compressed 
opcode allows the guess bits to be imbedded into the opcode 
(decode address) without requiring a larger ROM. The 
decode ROM allows quick target address generation and 
thus, execution within the cycle time. The resulting opcode 
compression 62 and branch instructions 64 are shown in the 
table of FIG. 4. The 12-bit opcodes for the extended instruc- 
tions are reduced to eight bits before entering the I register 
44 which addresses the decode ROM 42. It is to be noted that 
for the instructions shown, 1 60 operations are compressed to 
62 operation codes. This technique, along with the compres- 
sion of input/output operations, allows 384 required instruc- 
tions to be decoded from a 256-byte ROM. Avoiding the use 
of a 512-byte ROM, which would have been needed without 
compression. This represents a significant saving in hard- 
ware size and speed. 

It can be seen that the guess bit 66 is only relevant to 
conditional branches and that a guess would only effect the 
operation code of a conditional branch if the CPU is not 
processing a previous branch, as indicated by the branch in 
progress unit 24. The branch logic 22 combines operation 
code, a prediction signal and a signal which indicates 
another branch is in progress. 

The branch prediction algorithm disclosed has achieved 
an accuracy of approximately 85 percent of the instruction 
sets tested. This is led to an overall performance improve- 
ment of approximately seven percent. The additional hard- 
ware is easily justified by this performance improvement. 
The hardware was limited to 256-bit RAM, guess logic and 
operations code compression logic. The guess logic and the 
compressed opcode, are done in microcode. This allows the 
task to be handled with good performance in a minimum of 
space. 

Although a specific embodiment of the present invention 
has been disclosed, it will be understood by those of skill in 
the art that the foregoing and changes in form and detail can 
be made therein without departing from the spirit and scope 
of the invention. 

What is claimed is: 

1. A computing machine including a main memory and 
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employing conditional branch prediction comprising: 
a prediction random access memory coupled to the main 
memory for receiving a selected number of least sig- 
nificant bits of a previously written memory address; 
branch prediction logic for determining if a branch guess 
is in progress; generating a guess bit, if a branch is in 
progress, and producing a prediction based on the 
previously written memory address; 

operational code compression means for re-mapping all 
processor execution instruction files into compressed 
operational codes and embedding the guess bit into the 
compressed operational codes of lesser size than origi- 
nal operational codes included in the instruction files; 
and 

control logic means for interfacing between the main 
memory and an instruction execution read-only- 
memory to fetch execution instruction files from main 
memory, decode the instruction based on the com- 
pressed operational code including the embedded guess 
bit, and predict a conditional branch based on the 
previous written memory address and the guess bit 
wherein the read-only-memory is reduced in size due to 
the compressed operational codes including the embed- 
ded guess bit and the computing machine is improved 
in performance. 

2. The computing machine of claim 1 wherein the branch 
prediction RAM is written with branch data after each time 
a branch has been resolved. 

3. The computing machine of claim 2 wherein the opera- 
tional code compression means is coupled between the 
branch prediction RAM and an instruction register. 

4. In a data processing system, a memory, a processor, an 
instruction execution unit and a branch prediction mecha- 
nism for handling conditional branch instructions compris- 
ing: 

a) a branch prediction RAM coupled to the memory, the 
branch prediction RAM receiving a selected number of 
least significant, digits of a previously written address 
from the memory as an address in the RAM; 

b) branch logic coupled to the branch prediction RAM for 
(a) determining if a branch instruction is in progress in 
the instruction unit, and (b) providing a guess bit if a 
branch is in progress; 

c) an operational code compression means coupled to the 
memory for mapping cacn operation code to form a 
new operation code of a lesser number of bits and 
embedding the guess bit in each new operation code; 
and 
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d) control means interfacing between the memory and the 
instruction execution unit for decoding the compressed 
operational code and predicting a conditional branch 
based upon the guess bit provided a branch instruction 
is not in progress. 

5. The data processing system of claim 3 further including 
a precode RAM coupled to the processor instruction files; 
the precode RAM generating a microcode word as an input 
to an instruction register, and a decode ROM coupled to the 
instruction register for generating further microcode words 
related to the microcode word, the further microcode words 
provided as successive inputs to the instruction execution 
unit. 

6. The system of claim 5 wherein each operation instruc- 
tion code in the system is a 12-bit word and each compressed 
operation code is an 8-bit word whereby the number of 
system instruction operations are reduced from 160 to 62 
through the use of a decode ROM of lesser size than a 
predecode RAM. 

7. A method of branch prediction in a data processing 
system including a memory, an instruction execution unit, 
and control means interfacing the memory and the instruc- 
tion execution unit comprising the steps of: 

a. generating a branch prediction signal in a branch 
prediction RAM using a selected number of least 
significant digits of a previously written address from- 
the memory as an address for the RAM; 

b. determining if a branch is in progress in the instruction 
execution unit and generating a guess bit if a branch is 
in progress in a branch prediction logic means; 

c. combining the branch prediction signal and guess bit in 
the branch prediction logic; 

d. compressing all processor instruction files in an opera- 
tion code compression means to form new operation 
codes of a lesser number of bits and embedding the 
guess bit in the new operational codes; and 

e. decoding the new operational code and predicting * a 
conditional branch as an input to the instruction execu- 
tion unit based on the guess bit. 

8. The method of claim 7 further comprising the step of 
forming each processor execution file as a microcode word 
and embedding the guess bit into the word for execution 
provided a branch is not in progress in the system. 

9. The method of claim 8 further including the step of 
decoding the microcode word and guess bit in a decode 
instruction ROM within a cycle lime of the system. 



Docu 
ment 



U 



Title 



Current 
OR 



USD 

60888 
08 A 



US 

58482 
68 A 



US 

58357 
46 A 



US 

58322 
58 A 



US 

58290 
49 A 



US 
57991 
63 A 



□ 



Low power consumption semiconductor integrated circuit device and 



•microprocessor-. 



Data processor with branch target address generating unit 



Method and apparatus for fetching and issuing dual-word or multiple 



jnstructons.in^a-data.processing..systexp.-. 



Digital signal processor and associated method for conditional data 

operation..with..na.canditian.coda.update. 

Simultaneous execution of two memory reference instructions with only 

»M* ,* u JpX^I* • ^^Cl ^>iX'^l#^ik^ B W*^B«^ B | l^h^Llltfl'b^B^ f*l** mm ++++ »*a * « •« * * *m mwww •••• »*• **** **** **** 

Opportunistic operand forwarding to minimize register file read ports 



713/324 



712/233 



712/215 



712/226 



711/168 



712/205 



8 



US 

57520 
14A 

US 

57349 
13 A 



US 

57130 
12 A 



Automatic selection of branch prediction methodology for subsequent 
~ brancbanstructiQtt.based.oa.outc^ 

Low power consumption semiconductor integrated circuit device and 



microprocessor. 

Microprocessor 



712/240 



713/322 



712/233 



10 



11 



12 



13 



14 



US 
56491 
45 A 



US 

56385 
24 A 



US 

56341 
18 A 



US 

56175 
50 A 



US 

55926 
37 A 



Data processor processing a jump instruction 



Digital signal processor and method for executing DSP and RISC class 
instructions defining identical data processing or data transfer 



Splitting a floating-point stack-exchange instruction for merging into 



surrounding.instructions.by-joperancl.translation ~ 

Data processor generating jump target address of a jump instruction in 



parallel..with.decoding.of-the.jnstruction. 

Data processor processing a jump instruction 



711/213 



712/221 



712/226 



712/207 



712/237 



15 



US 

55902 
96 A 



Data processor processing a jump instruction 



712/229 



16 



US 

55375 
61 A 



Processor 



712/23 



17 



18 



19 



US 

54855 
87 A 



US 

54577 
90 A 



US 

53983 
21 A 



Data processor calculating branch target address of a branch instruction 



ia.parallel»with..decoding.of.thejnstructioa 

Low power consumption semiconductor integrated circuit device and 



microprocessor. 

Microcode generation for a scalable compound instruction set machine 



712/234 



711/167 



712/216 



20 



21 



US 

48581 
05 A 



US 

44398 
27 A 



Pipelined data processor capable of decoding and executing plural 



~ -instructions.ia.parallei...-. 

Dual fetch microsequencer 



712/235 



712/235 



Page 



1 



(KKiml, 11/14/2000, EAST Version: 1.01 



0015) 



5,826,054 



15 



operation, the compressed operation length being chosen 
from a plurality of finite lengths, which finite lengths include 
at least two non-zero lengths, which of the finite lengths is 
chosen being dependent upon at least one feature of the 
operation. 

2. The medium of claim 1 wherein the compressed 
operation length can only be one of 26, 34 and 42 for any 
non-null operation, 

3. The medium of claim 1 wherein the at least one feature 
is at least one of the following: 

abbreviated op code; ^ 
guarded or unguarded; 
result less; 

immediate parameter with fixed number of bits; and 
zeroary, unary, or binary. 

4. The medium of claim 3 wherein combined operation 
types are aliased according to the following table 



FORMAT 



ALIASED TO 



zeroary 


unary 


unary__resultless 


unary 


b i nary_resu ltless_s ho it 


binary„resultless 


zeroary_param32__short 


zeroary__param32 


zeroary_param32_resultless„_short 


zeroary param32_resultless 


zeroary short 


unary 


unary Tesultless„short 


unary 


binary resultless_unguarded 


binary— resultless 


unary_unguarded 


unary 


bina r y__p a ram 7_resu 1 t less_unguard ed 


binary param7_resultless 


unary_unguarded 


unary 


binary_pararn7_j:esultless_unguarded 


binar y p a ram 7_re sultless 


zeroary unguarded 


unary 


unary_resulttess_unguarded short 


binary unguarded short 


unary_unguarded_short 


unary_jshort 


zeroary_param32_unguarded_short 


zeroary_param32 


zeroary_parame32_resuluess_un- 


zeroary_param32_resuItless 


guarded_short 




zeroary_unguarded_short 


unary 


unary_resultless_unguarded_short 


unary 


unary_long 


binary 


binary„Jong 


binary 


binary„resultless_long 


binary 


unary param.7 long 


unary param7 


binary param7_resultless long 


binary__param7_resultless 


zeroary__param32„_long 


ze roa ry„p aram 32 


zeroary param32„resultless long 


zeroary__param32 resultless 


zeroary long 


binary 


una ry_iesult less long 


binary 



5. The medium of claim 3, wherein the fixed number is 
one of 7 and 32. 

6. The medium of claim 1 comprising a plurality of such 
instructions, of which one instruction is a branch target, 
which one instruction is not compressed. 

7. The medium of claim 1 wherein each operation field 
within each instruction includes a sub-field specifying at 
least one of the following: a register file address of a first 
operand; a register file address of a second operand; a 
register file address of guard information; a register file 
address of a result; an immediate parameter; and an op code. 

8. The medium of claim 1 comprising a plurality of such 
instructions, each instruction comprising a format field for 
specifying a plurality of respective formats, one respective 
format for each operation of a succeeding instruction. 

9. The medium of claim 8, wherein the compressed format 
comprises a format field specifying issue slots of the VLIW 
processor to be used by some instruction. 

10. The medium of claim 9 comprising at least one field 
specifying the operation. 
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11. The medium of 10 wherein the at least one field 
specifying the operation comprises at least one byte aligned 
sub -field. 

12. The medium of claim 10 further comprising at least 
one operation part sub-field located in a same byte with the 
format field. 

13. The medium of claim 12 wherein the format field 
specifies that more than a threshold quantity of issue slots 
are to be used and further comprising at least one first 
operation part sub-field located in a same byte with the 
format field, a plurality of sub-fields specifying operations, 
and at least one second operation part sub-field located in a 
byte separate from the other sub-fields. 

14. The medium of 9 wherein the format field has 2*N 
bits, where N is the number of issue slots. 

15. The medium of claim 9 wherein the instruction takes 
up no more than 32 bytes. 

16. The medium of claim 9 formatted as follows 



<instruction> ::= 

instruction start> 

instruction middle> 

instruction end> 

instruction extension> 
instruction start> ::= 

<Format:2*N>{<padding:l>}V2{<2-bit operation part:2>}Vl{<24- 

bit operation part :24>}V1 
instruction middle> ::= {{<2-bit operation part:2>}4 {24-bit 

operation part:24>}4}V3 
instruction end> ::= {<padding:l>}V5{<2-bit operation 
part:2>}V4 {24-bit operation part:24>}V4 
instruction extension>"={<operationextension:0/8/16>}S 
<padding>::= "0" 



Wherein the variables used above are defined as follows: 
N=the number of issue slots of the machine, N>0 

S=the number of issue slots used in this instruction 

(O^S^N) 
C1=4-(N mod 4) 

If (S^Cl) then VI -S and V2«2*(C1-V1) 
If (S>C1) then V1=CI and V2=0 
V3KS-V1) div 4 
V4=(S-V1) mod 4 

If (V4>0) then V5=2*(4-V4) else V5=0 
45 Explanation of notation 





means 


"is defined as" 


<field name:number> 






means 


the field indicated before the colon has 
the number of bits indicated after the 
colon. 


{<field 


name>}aumber 






means 


the field indicated in the angle 
brackets and braces is repeated the 
number of times indicated after the 
braces 


"0" 


means 


the character "0" 


"div" 


means 


integer divide 


"mod" 


means 


modulo 


:0/S/16 


means 


that the field is 0, 8, or 16 bits long. 



17. The medium of claim 9 containing an operation which 
is encoded in 26, 34 or 42 bits, wherein 

if the operation is 26 bits, it is one of 
binary unguarded short; 

unary immediate 7-bit parameter unguarded operation; 
binary unguarded immediate 7-bit operand resultless 
short; and 
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unary short; 

if the operation is 34 bits, it is one of 
binary short; 

unary immediate 7-bit parameter resultless short; 
binary unguarded; 

unary immediate 7-bit parameter unguarded; and 
unary; and 

if the operation is encoded in 42 bits, it is one of 
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18 

binary immediate 7-bit parameter resultless; 
binary; 

unary immediate 7-bit parameter; 

5 zeroary immediate 32-bit parameter; and 

zeroary, immediate 32-bit parameter resultless. 

18. The medium of claim 9 wherein the operations arc 
encoded according to the following table: 



hit positiop 



name 




24-bit operation part 




2-bit part 
24-25 


Extension 
26-34-^1 


Size 


0-6 


7-33 


14-20 


21-23 


2 6 -format: 
















<binary- 


srcl[0:6] 


src2[0:6] 


dst[0:6] 


opcode[0:2] 


opcode[3:4] 




26 


un guarded- 
















short > 
















<unary- 


srcl[0:6] 


para m[ 0:6] 


dst[0:6] 


opcode[0:2] 


opcode[3:4] 




26 


param7- 
















unguarded- 
















short> 
















<binary- 


srcl[0:6] 


src2[(J:oJ 


para m[ 0:6] 


opcode[0:2] 


opcode[3:4] 




26 


un guarded- 
















pa ram7- 
















resultless- 
















short> 
















<unary- 


srcl[0:6] 


dst[0:6] 


guard[0:6] 


opcode[0:2] 


opcode[3:4] 




26 


short> 
















34 -format: 
















cbinary- 


srcl[0:6] 


src2[0:6] 


guard[0:6] 


opcode[0:2] 


opcode[3:4] 


dst[0:6] 0 


34 


short> 
















<unary- 


srcl[0:6] 


param[0:6] 


guard[0:6] 


opcode[0:2] 


opcode[3:4] 


dst[0:6] 0 


34 


param-7- 
















short> 
















<binary- 


srcl[0:6] 


src2[0:6] 


guard[0:6] 


opcode[0:2] 


opcode[3:4] 


param 


34 


param7- 












[0:6] 0 




resultless- 
















short> 
















<binary-un 


srcl[0:6] 


src2[0:6] 


dst[0:6] 


opcode[0.2] 


opcode[3:4] 


opcodef5:7] 


34 


guarded> 
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src2[0:6] 


guard[0:6] 


opcode[0:2] 


opcode[3:4] 


opcode[5:7] 


34 


resultless> 












XI 001 
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srcl[0:6] 


para m[ 0:6] 


dst[0:6] 


opcode[0:2] 


opcode[3:4] 


opcode[5:7] 


34 


param7-un- 












SL111 




guarded> 
















<unary> 


srcll0:6] 


dst[0:6] 


guard[0:6] 


opcode[0:2] 


opcode[3:4] 


opcode[5:7] 


34 
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<binary- 


srcl[0:6] 


src2[0:6] 


guard[0:6] 


opcode[0:2] 


opoode[3:4] 


opcode[5:7] 


42 


param7- 










SXX100 




iesultless> 












param[0:6] 




<binary> 


srcl[0:6] 


src2[0:6] 


guard {[0:6] 


opcode[0:2] 


opcode[3:4] 


opcode[5:7] 


42 














XL0101 
















dst[0:6] 




<unary- 


srcl[0:6] 


para m[ 0:6] 


guard[0:6] 


opcode[0:2] 


opcode[3:4] 


opcodc[5:7] 


42 


param7> 












SL1101 
















dst[0:6] 




czeroary- 


param 


param[0:6] 


dsl[0:6] 


param 


param 


param 


42 


param32> 


[7:13] 






[14:16] 


[17:18] 


[19:23] 
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pa ram 


param[0:6] 


guard[0:6] 


param 


param 


param 


42 


param32- 


[7:13] 






[14:16] 


[17:18] 


[19:23] 




resultless> 












000 param 








param|0:6] 








[24:31] 
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param 


param 


param 
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[14:16] 


[17:18] 


[19:23] 




resultless> 












100 param 





[24:31] 





L# 


Hits 


Search lext 


A 
1 


L1 


11244 


• (3DDrevi3u> J compressipo compacupoj nearo iinsuucuon upuuu 


2 


L3 


!29 


j(recoverc>o recreate iransiormcbo iransiauDo/ nearou i 


3 


L8 1821 


=(Dase nearzO (onset displacement) j nearzu lade 


A 

4 


L9 


I933 


i(Dase nearzo (onset displacement)) nearzu memory 


5 




L10 


k 

1 87776 


i(traslat$3 conver$4 mapif>4 expanit>4) neano (taDie memory) 


6 


L11 


I770 


:(8 9) and 10 


7 


_„ 

L12 


^. ........... ...... 

1 82 


•to 9) near99 10 


8 


L13 


i82 


:(B 9) nearoO 10 


9 


L14 


1 141 


•(trasiatvpo converct>4 mapct>4 expand) nearou (o yj 


10 


L15 


130840 


i(traslat$3 conveni>4 mapii>4 expand) aajo (tauie memory) 


A A 
1 1 


L16 


143 


■ (base neaivo (OTTset aispiacernenijj nearou id 


A *"* 

12 


L17 


11797 


j(base near^o (OTTset oispiacemenT;) nearzu (inaex^o seiecut>o e 


13 


1 A O 

L18 


• O 

i° 


: a —j ____on AC 

■ 17 near20 1b 


A A 

14 


L19 


1 28043 


j(traslat$3 conver$4 map$4 expan$4) neario code 


15 


L20 


!3 


;19 nearoO (8 9) 


16 


L21 


114 


i(base near20 (offset displacement)) nearzO iy 


17 
i / 


L4 


!68 


^ninplin^n Qtanp pvpIp^ npprSfl 1 


18 


L7 


!2 


jdual nearlO 1 


19 


L6 


1133 


i (first second) neaiiO 1 


20 


L5 


!57 


i(two several couple) nearlO 1 



c 



Page 1 (KKiml, 11/14/2000, EAST Version: 1.01.0015) 



U.S. Patent 



Oct. 20, 1998 



Sheet 7 of 15 



5,826,054 



CNJ 



co 



CT3 
CO 



X 
X 



o 

CO 



co 

CD 
CO 



CNJ 



CO 
"CD 

CO 



CO 



CT3 



CO 



CNJ 
CO 

CO 



CVJ 



CO 

era 
~co 

"CD 



X 



co 
o 



CO 
<u>' 

CD 
CO 



CNJ 



TD 
CO 



CO 
CO 

czn 



CO 



CNJ 

co 



CO 



CO 



CO 



I 

CNJ 



CO 
CO 



CO 

ao 



CO 
CO 
I 

ra 
V 



CO 



co 

CO 



CO 

V 



CNJ 



CO 

"to 
TO 



CO 
a? 

CD 



CO 
CO 

CD 
co 



CNJ 



QL3 

CD 
CO 



CO 

CD 



co 



to 
CD 

e 

ra 

CO 



CD 



CO 
CO 



r-— 
ca 

CO 
1 

CO 
V 



CNJ 



CO 
CNJ 

EE 

CO 

ex 

g 

CO 
CNI 

cr> 

cO 

CO 

ex. 



CO 



CO 
CO 



CO 



CO 
CO 



CO 



CO 



CO 

czd 

e 

C O 
CO 



CO 



cO 
CO 



/\ 

CNJ 
CO 

CO 

CO 



CO 



CJD 

r-g 
V 



CNJ 



CO 
CNJ 

CO 
CO 



o 

CO 

CNJ 

oS 



CO 
CO 



CO 



CO 
CO 



CO 



r o 
CO 



CO 
CO 

czn 



CO 



CO 
CO 



co 



CO 
CO 



CO 
CO 

co 



CO 
I 

CNJ 
co 



CO 

CO 



CO 
CD 

a5 



CNJ 



CO 
CNJ 

CO 
CO 



CO 

CNJ 

cri 



CO 
CO 



oo 



CO 

CO 



CO 



CO 

co 



CO 



CO 



CO 



CO 
CO 



CO 



CO 
CO 



CO 

CO 

co 



try 

CO 
I 

CNJ 

CO 



CO 
CO 
I 



CO 

o> 
a5 

V 



(13 

C3H I) 
CO 1 

t=r co 
ro co 



II CD 

CO -j= 

- co 
1 CD 



to co 

.. CO 

rtzz _co 

CO co 
CO 

< — CO 

CD 

• ; — . CD 

CO c= 
cl5 co 



CD 

CO 



CO 



CO cz 
F= co 

CO , — 

— *- CO 
CD §5 
— CO 



CO ~ 
err CD 

£ II 



co 



CO -I 

co t: > 

^ c5 — 



55 co. 

- - c5>_22 
^ 

CD . . 



CO 



CO 



CO 



X 



Docu 
ment 



U 



Title 



Current 
OR 



USD 1 

61287 O j Fault-tolerant multiple processor system with signature voting 

55 A l ! 



1714/715 



61280 ^^ Printer havin 9 processor with instruction cache and compressed program ; 358/1 15 



94 A 



..store. 



us I,—,! Methods and apparatus for scalable instruction set architecture with 
61015 Of 



1712/20 



92 A 



j- dynaauacompact instructions. 



!„-,! Power simulation system, power simulation method and computer-readable! 
60960 ;703/18 

§!L* I L xecording.medium.for..recarding^ — I 

US • I 

60444 !^ ! Processor for VLIW instruction 1712/24 

50 A ! i 



US 

59833 
35 A 

■ W" »* *• **** *M ■ 

us 

59604 
65 A 



Computer system having organization for multiple condition code setting 



1712/23 



L and-for..testiAg.instruction..out:Of^order « -■■ 

| Apparatus and method for directly accessing compressed data utilizing a 

compressed memory address translation unit and compression 
i descriptor 



1711/208 



8 



10 



11 



US 

59387 
59 A 



Prctefefeor instruction control mechanism capable of decoding register 



1712/209 



US 

59058 
93 A 



instmctions.and.immediate.insiructions.w^ 

i Microprocessor adapted for executing both a non-compressed fixed length 



1717/5 



US 

58965 
19A 



instr.uction.set.and.a.compi^essed.variableJeagth.instructioa.set 

| Apparatus for detecting instructions from a variable-length compressed 



1712/213 



US 

58928 
47 A 



L instr.uction.set.having.extended..and.aon.-extended.instr.uctions. 

Method and apparatus for compressing images 



382/232 



12 



13 



14 



US 

58813 
08 A 



Computer organization for multiple and out-of-order execution of 



1712/23 



US 

58812 
60 A 



US 

58676 
81 A 



condition..code.testing.and..setting.instr-uctions..out-of«order 

I Method and apparatus for sequencing and decoding variable length 

* 

instructioi^.with.anJnstt^ 

I Microprocessor having register dependent immediate decompression 



1712/210 



1712/208 



15 



16 



17 



18 



19 



US 

58601 
52 A 



I Method and apparatus for rapid computation of target addresses for 



1711/213 



US 

58225 
78 A 



relative.control.transfar..instructions 

System for inserting instructions into processor instruction stream in 



1712/244 



US 

58190 
58 A 



j order..to..perforrn.interrupt.processing 

ilnstruction compression and decompression system and method for a 



1712/210 



US 

57940 
10 A 



processor 

Method and apparatus for allowing execution of both compressed 



! 703/20 



US 

57845 
85 A 



jnstructions.and.xieconipressed.instructionsJn.a.mici:oprocessor.. 

Computer system for executing instruction stream containing mixed 



compressed and uncompressed instructions by automatically detecting 



1712/209 



land 



20 



US i ] Mu^»^P^i^n9©W^^feiB^jQ^iSWif)ression using parallel encoded 
57547 ||2Slj 

46 A I ! streams 



358/1.15 



Page 1 (KKiml, 11/14/2000, EAST Version: 1.01.0015) 



U.S. Patent Oct. 20, 1998 Sheet 3 of 15 



5,826,054 



CACHE WORD 1 



CACHE WORD 2 




FIG. 2a 



CACHE WORD 1 

777T7 



CACHE WORD 2 




FIG. 2b 
FIG. 2c 



CACHE WORD 1 



CACHE WORD 2 




FIG. 2d 

FIG. 2e 



Docu 
ment 



U 



Title 



Current 
OR 



21 



22 



23 



24 



25 



26 



27 



28 



29 



U3D 

56690 
11 A 



US 

56301 
57 A 



Partially decoded instruction cache 



712/23 



US 

56300 
85 A 

■ •* 9-9 WW 99 9 WW WW 9>W9"W I 

US 

56219 
07 A 

WW WWWW < 

us 

55442 
47 A 



US 

55108 
57 A 



US 

54817 
51 A 



US 

54816 
43 A 

■ • a mm h mm m wm- wm-ww 

US 

54483 
10A 



j Computer organization for multiple and out-of-order execution of 



xonditioa.code.testing.andsettiag.instruction&. 



712/23 



j Microprocessor with improved instruction cycle using time-compressed 

* 
* 

* 

* 

* 

j Microprocessor with memory storing instructions for time-compressed 
fetching of instruction data for a second cycle within a first machine 



1712/207 



1712/207 



mwmww wwwwwwwW M+tm^'iM «■«-»■ 



! — cycle""™" 

jTransmission and reception of a first and a second main signal component 



381/27 



Motion estimation coprocessor 



348/699 



! Apparatus and method for storing partially-decoded instructions in the 



1712/213 



j instructionxache.xrf.a.CEU.having.multi^ 

I Transmitter, receiver and record carrier for transmitting/receiving at 



.Jeast.a.ficst-and.a-second.signat.component. 



! 704/227 



I Motion estimation coprocessor 



348/699 



30 



US 

53921 
26 A 



3 1 Airborne thermal printer 



358/296 



31 



US 

53778 
25 A 



[Compact disc storage case 



1 206/232 



32 



33 



34 



US 

53236 
18 A 



| Heat storage type air conditioning apparatus 



162/149 



US 

53234 
88 A 

****** 

US 

52066 
60 A 



Memory access method and circuit in which access timing to a memory is 



divided.into..M.periads.ta.be.ac^ 



358/1.16 



Airborne thermal printer 



347/218 



35 



36 



37 



38 



39 



40 



41 



US 

52028 
87 A 



El 



Access control method for shared duplex direct access storage device and 



714/8 



US 

49244 
31 A 



cornputer..systern..therefor. 

j Keyboard located indicia for instructing a multi-mode programmable 



1708/146 



US 

48736 
30 A 



.coniputer-having.alphaaumeric..capabilitiesJrcm.a.few.Jceyboaxd.keys.. 

Scientific processor to support a host processor referencing common 



1712/3 



US 

48356 
79 A 



memory 

Microprogram control system 



712/212 



US 

48335 
99 A 



Hierarchical priority branch handling for parallel execution in a 



1712/236 



US 

48036 
20 A 



parallel-processor. 

Multi-processor system responsive to pause and pause clearing 



1712/203 



US 

47991 
46 A 



instruction&.tor.instructioa.execution.control 

System for displaying graphic information on video screen employing 

v/iHpn Hkplay prnrw^r 



1710/260 



Page 



2 (KKiml, 11/14/2000, EAST Version: 1.01 



. 0015) 



U.S. Patent 



Oct. 20, 1998 



Sheet 4 of 15 



5,826,054 



CO 
CO 



o 
o 



CD 
CC 



cc 

CO 



O 

o 

cc 
I — 

CO 



CO 
CO 



o 

1 

CM 

o 

I — 

o 

cc 
CO 



CO 
CO 



DC 
Q_ 

O 

o 

1 

CO 

o 

o 

QC 
I — 
CO 



CO 
CO 



o 
o 

I 

O 

o 

oc 
i — 

CO 



CO 

o 



cc 



O 



CNJ, 
CO 

o 



O 



CO 

CO 



QC 



cu 
O 



co 
o 

I — 



O 



i — 



CO 

I — 

QC 
O 



I — 

QC 

o 



X 

1 — 

<c 

QC 
O 





Docu 
ment 


1 1 t:*Ia 

u Title 


Current 
OR 


42 


USD 

47706 
02 A 


«-J Method of capacity controlling of multistage compressor and apparatus 
therefor- 


415/29 


43 


US 

47681 
57 A 


[2! Video image processing system 


345/517 


44 


US 

45716 
34 A 


* 

Digital image information compression and decompression method and 


358/261.3 


45 


US 

44581 
10A 


— r -apparatus.™ .« - ~ » 

[3 i Storage element for speech synthesizer 

: 

^* ******** TUT l¥ll Till p*** **♦****! ***«««♦* **************** B«H **B*MM MM MM tIMt IMI *••••***•••••••* *«-** ************ M** *M* **M **** ***• ■■»••*••**•******** •**•*■** **■• 


704/211 ; 

■ 


46 


US 
43841 
70 A 


X ! Method and apparatus for speech synthesizing 

• 


704/266 


47 


US 

43800 
70 A 

US 

43097 
56 A 


I Automatic circuit identifier 


340/537 


48 


-r-, i Method of automatically evaluating source language logic condition sets 

• 

ar^.ofxompiling.machine.execut^ 


717/9 

t : 


49 


US 

42888 
16 A 


1^1 Compressed image producing system 

z 


382/232 i 


50 


US 

42141 
25 A 


^1 Method and apparatus for speech synthesizing 

• 


704/268 j 


51 


US 

41975 
78 A 


• 
* 

Microprogram controlled data processing system 


1712/211 


52 


US 

41715 
36 A 


I Microprocessor system 


711/151 


53 


US 

41445 
63 A 


• 
• 

(3 1 Microprocessor system 


712/42 


54 


US 

40939 
82 A 


[53 ' Microprocessor system 


712/248 


55 


US 

40588 
50 A 


±=- - — 

* 


711/117 


56 


US 

39361 
82 A 


X ! Control arrangement for an electrostatographic reproduction apparatus 


399/77 


57 


US 
36561 
78 A 


* 

Si DATA COMPRESSION AND DECOMPRESSION SYSTEM 

• 

• 
• 


341/87 



Page 3 (KKiml, 11/14/2000, EAST Version: 1.01.0015) 



U.S. Patent Oct. 20, 1998 Sheet 5 of 15 



5,826,054 



FIG. 4a 



field name 
field size 



format 
10 



XXX 
222 



byte f 1 f 2 f 



FIG. 4b 



r field name 
field size 



format 
10 



XXO 
1 

222 



operation 1 
24 



byte t 1 t 2 f 3 f 4 f 5 f 



FIG. 4c 



r field name 
field size 



format 
10 



X00 
21 
222 



operation 1 
24 



operation 2 
24 



byte f If 2f 3f 4f 5f6f 7f8f 



r field name 

FIG. 4d 



format 
10 



000 
321 
222 



operation 1 
24 



operation 2 
24 



operation 3 
24 



byte +1 + 2 t3t4t5 + 6+ 7f8|9tl0tllt 

3 issue slots used 



FIG. 4e 



format 
10 



000 
321 
222 



operation 1 
24 



operation 2 
24 



operation 3 
24 



XXXO 
4 

2121212 



tl+2|3t4t5t6+7+8t9tl0tlltl2 + 



operation 4 


Extension 


Ext 




2 


4 


24 







r 



f 13 f 14 f 15 f 16 f 17 f 18 t 



format 
10 



000 
321 

222 



operation 1 

24 



operation 2 
24 



operation 3 
24 



XXOO 
54 
2 2 2 2 



FIG. 4f 



tlt2t3t4t5t6 + 7t«'+9tl0tHtl2t 



operation 4 


operation 5 


Exl 


Ext 


Ext 


Ext 


Ext 


24 


24 


1 


2 


3 


4 


5 







I t 13 f M t 15 1 16 1 17 f 18 f 19 f 20+ 21 f 22 f 23 f 





Docu 
ment 


u i itie 


Current 
OR 


1 


USD ! 
61080 ! 
30 A | 


□ {Appearance inspecting device for solid formulation 


348/91 


2 


US 

61015 ! 
92 A 


* 2 

^ 1 Methods and apparatus for scalable instruction set architecture with 

™.c^namic.compact.instructions - — -« •• i 


712/20 
712/24 


3 


US 

60444 
50A 


[3 1 Processor for VLIW instruction 

z 


4 


US 

59604 
65 A 


s 

• Apparatus ana meinoo Tor airecuy accessing compressed aata utilizing a 

• 

* 

Si compressed memory address translation unit and compression 


711/208 | 


5 


US 

59305 
08 A 


K-JMetflkfc for storing and decoding instructions for a microprocessor having 
L - L ^.pluxali^-.ofiunction..units - - 


717/6 


6 


US 

59073 
74 A 


p-^j Method and apparatus for processing a compressed input bitstream 

* 


375/240.2 

jO 


7 


US 

58984 
62 A 


Ik-* ! Methods of producing data storage devices for appliances which can be 

• 

usedioxoach.iisers-in.the-perfoj'mance.-otuserTselected..task& 


348/552 | 


8 


US 

58017 
84 A 


• 

■ • 
• 

! X i Data storage devices 


348/552 


9 


US 

57643 
04 A 


^Operation of information/entertainment centers 

• 


348/552 


10 


US 

56943 
99 A 


>-J Processing unit for generating signals for communication with a test 


709/246 


11 


US 

56469 
46 A 

I »• * **-P^PP"P"PP*P * P * * * ■ 

US 

56320 
24 A 


[Apparatus and method for selectively companding data on a slot-by-slot 

> * 


1 370/442 


12 


1 Microcomputer executing compressed program and generating 
^1 compressed 

w p 
* • 
p p 
1 * 


712/205 


13 


US 

56106 
03 A 


^iSo^oVyi) ^f^l^fen method used with a static compression dictionary 

i * 


341/51 


14 


US 

55686 
50 A 


i S 

i a 

! X IControl unit for controlling reading and writing of a magnetic tape unit 

* 

1 I 


710/52 


15 


US 

55133 
01 A 


t • 

irri i Image compression and decompression apparatus with reduced frame 
r"* : memory 

■ 


358/1.15 


16 


US 

54483 
01 A 


i ■ 
* 

Programmable video transformation rendering method and apparatus 

■ 


348/578 


17 


US 

53613 
56 A 


• 

13 1 Storage isolation with subspace-group facility 

> • 


711/206 


18 


US 

53510 
46 A 


* 

> * 

13 1 Method and system for compacting binary coded decimal data 


1 34 1/62 


19 


US 

52033 
52 A 


* 

^| Polymeric foam earplug 

> * 


128/864 


20 


US 

50880 
31 A 


* 

Virtual machine file control system which translates block numbers into 

II2SH 


709/100 

[ \ 



Page 1 (KKiml, 11/14/2000, EAST Version: 1.01.0015) 



5,826,054 

9 10 

the unary format is used. In that case the field for the 

argument is undefined. TAB I E III 



TABLE II 



OPERATION 1TPII 


SIZK 


<binary-unguarded-short> 


26 


<uiwry-pararn7-unguarded-short> 


26 


<binary-un guarded-para m 7- 


26 


resultless-short> 




<unary-short> 


26 


<binary-short> 


34 


<unary-param7-short> 


34 


<binary-param7-resultless- 


34 


short> 




<binary-ungiiarded> 


34 


<binary-resultless> 


34 


<una ry-pa ra m7 -u ngii ard e d > 


34 


<iinary> 


34 


<binary-paiam7-resuHless> 


42 


<binary> 


42 


<unary-param7> 


42 


<zeroary-param32> 


42 


<zeroary-param32-resultlcss> 


42 



For all operations a 42-bit format is available for use in 
branch targets. For unary and binary-resultless operations, 
the <binary> format can be used. Ia that case, unused fields 
in the binary format have undefined values. Short 5-bit op 
codes are converted to long 8-bit op codes by padding the 
most significant bits with 0's. Unguarded operations get as 
a guard address value, the register file address of constant 
TRUE. For store operations the 42 bit, binary-paramT- 
resultless> format is used instead of the regular 34 bit 
<binary-param7-resultless short> format (assuming store 
operations belong to the set of short operations). 

Operation types which do not appear in table II are 
mapped onto those appearing in table II, according to the 
following table of aliases: 



TABLE II 



FORMAT 


ALIASED TO 


zeroary 


unary 


unary resultless 


unary 


binary resultless _s ho it 


binary„resultless 


zeroary_param32_short 


zcroary_param32 


zeroary__param32_resultless„short 


zeroary_param32_resultlcss 


zeroary__short 


unary 


unary_resultIess_short 


unary 


binary_resultless_unguarded 


binary_resultless 


unary unguarded 


unary 


binary__param7_resuitless unguarded 


bi nary pa ram 7 resultless 


u na r y__u nguarded 


unary 


binary„param7_resuhless unguarded 


binary param7_resultless 


zcroary_unguarded 


unary 


unary_resultless_u nguarded short 


binary_unguarded_short 


unary__unguarded_short 


unary_short 


ze roar y_pa ram 32_unguarded_short 


ze roa ry__param32 


zeroary_parame32_resultless_un- 


zeroary__param32_resultless 


guarded_short 




zeroarv unguarded short 


unary 


unary_resultles5_unguarded_short 


unary 


unary_long 


binary 


binary_long 


binary 


binary_resu!tless_long 


binary 


unary_param7_long 


unary_param7 


binary param7 rcsu It less Ion g 


binary_param7_resultless 


zeroary_parnm32_long 


zeroary__param32 


zeroary_parani32_rcsulllcss„long 


zcroary_param32_resultless 


zeroaryjong 


binary 


unary resultless long 


binary 



The following is a table of fields which appear in opera- 
tions: 





FIELD 


SIZE 


MEANING 


5 


srcl 


7 


register file 
address ot first . 
operand 




src2 


7 


register file 
address of second 
operand 


10 


guard 


7 


register file 
address of guard 




dst 


7 


register file 
address of result 




param 


7/32 


7 bit parameter or 
32 bit immediate 


15 






value 




op code 


5/8 


5 bit short op code 
or 8 bit long op 
code 



. FIG. 5 includes a complete specification of the encoding 
of operations. 

7. Extensions of the instruction formal 

Within the instruction format there is some flexibility to 
add new operations and operation forms, as long as encoding 
within a maximum size of 42 bits is possible. 

25 The format is based on 7-bit register file address. For 
register file addresses of different sizes, redesign of the 
format and decompression hardware is necessary. 

The format can be used on machines with varying num- 
bers of issue slots. However, the maximum size of the 

50 iastruction is constrained by the word size in the instruction 
cache. In a 4 issue slot machine the maximum instruction 
size is 22 bytes (176 bits) using four 42-bil operations plus 
8 format bits. In a five issue slot machine, the maximum 
instruction size is 28 bytes (224 bits) using five 42-bit 

35 operations plus 10 format bits. 

In a six issue slot machine, the maximum instruction size 
would be 264 bits, using six 42 -bit operations plus 12 format 
bits. If the word size is limited to 256 bits, and six issue slots 
are desired, the scheduler can be constrained to use at most 

40 5 operations of the 42 bit format in one instruction. The fixed 
format for branch targets would have to use 5 issue slots of 
42 bits and one issue slot of 34 bits. 

COMPRESSING THE INSTRUCTIONS 

45 FIG. 8 shows a diagram of how source code becomes a 
loadable, compressed object module. First the source code 
801 must be compiled by compiler 802 to create a first set 
of object modules 803. These modules are linked by linker 
804 to create a second type of object module 805. This 

50 module is then compressed and shuffled at 806 to yield 
loadable module 807. Any standard compiler or linker can 
be used. Appendix D gives some background information 
about the format object modules in the environment of the 
invention. Object modules II contain a number of standard 

55 data structures. These include: a header; global & local 
symbol tables; reference table for relocation information; a 
section table; and debug information, some of which are 
used by the compression and shuffling module 807. The 
object module II also has partitions, including a text 

60 partition, where the instructions to be processed reside, and 
a source partition which keeps track of which source files the 
text came from. 

A high level flow chart of the compression and shuffling 
module is shown at FIG. 9. At 901, object module II is read 
65 in. At 902 the text partition is processed. At 903 the other 
sections are processed. At 904 the header is updated. At 905, 
the object module is output. 
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FIG. 10 expands box 902. Al 1 001, the reference table, i.e. 
relocation information is gathered. At 1002, the branch 
targets are collected, because these are not to be compressed. 
At 1003, the software checks to see if there arc more files in 
the source partition. If so, at 1004, the portion corresponding 
to the next file is retrieved. Then, at 1005, that portion is 
compressed. At 1006, file information in the source partition 
is updated. At 1007, the local symbol table is updated. 

Once there are no more files in the source partition, the 



12 

DECOMPRESSING THE INSTRUCTIONS 



in order for the VLIW processor to process the instruc- 
tions compressed as described above, the instructions must 
be decompressed. After decompression, the instructions will 
fill the instruction register, which has N issue slots, N being 
5 in the case of the preferred embodiment. FIG. 12 is a 
schematic of the decompression process. Instructions come 
from memory 1201, i.e. either from the main memory 104 
or the instruction cache 105. The instructions must then be 



global symbol table is updated at 1008. Uen al 1009, ™ dcshuffled 1201 which win be laincd further below> 
address references in the text section are updated. J hen at before b(j . decompressed 1203 Mler decompression 



updated 

1010, 256-bit shuffling is effected. Motivation for such 
shu filing will be discussed below. 

FIG. 11 expands box 1005. First, it is determined at 1101 
whether there are more instructions to be compressed. If so, 
a next instruction is retrieved at 1102. Subsequently each 
operation in the instruction is compressed at 1103 as per the 
tables in TIGS. 5a and 5b and a scatter table is updated at 
1108. The scatter table is a new data structure, required as a 
result of compression and shuffling, which will be explained 
further below. Then, at 1104, all of the operations in an 
instruction and the format bits of a subsequent instruction 
are combined as per FIGS. 4a~4e. Subsequently the reloca- 
tion information in the reference table must be updated at 
1105, if the current instruction contains an address. At 1106, 
information needed to update address references in the text 
section is gathered. At 1107, the compressed instruction is 
appended at the end of the output bit string and control is 
returned to box 1101. When there are no more instructions, 
control returns to box 1006. 

Appendices B and C are source code appendices, in which 
the functions of the various modules are as listed below: 

TABLE IV 



Name of module 


identification of function performed 


scheme_table 


readable version of table of FIGS. 5a 




and 5b 


comp shuffle, c 


256-bit shuffle, see box 1010 


comp_scheme.c 


boxes 1103-1104 


comp_bitstring.c 


boxes 1005 & 1009 


cotnp rnain.c 


controls main flow of FIGS. 9 and 10 


comp___src.c. 


miscellaneous support routines for 


co mp_refe re nee x, 


performing other functions listed in 


comp_misc.c. 


FIG. 11 


comp„btarget.c 





1203, the instructions can proceed to the CPU 1204. 

Each decompressed operation has 2 formal bits plus a 42 
15 bit operation. The 2 format bits indicate one of the four 
possible operation lengths (unused issue slot, 26-bit, 34-bit, 
or 42-bit). These format bits have the same values is 
"Format" in section 5 above. If an operation has a size of 26 
or 34 bits, the upper 8 or 16 bits are undefined. If an issue 
20 slot is unused, as indicated by the format bits, then all 
operation bits are undefined and the CPU has to replace the 
op code by a NOP op code (or otherwise indicate NOP to 
functional units). 

Formally the decompressed instruction format is 

25 

decompressed inst ruction > {<decom pressed opcration>}N 
decompressed operation* ::=<operation:42xformat:2> 



30 



35 



40 



45 



50 



55 



The scatter table, which is required as a result of the 
compression and shuffling of the invention, can be explained 
as follows. 

The reference table contains a list of locations of 
addresses used by the instruction stream and corresponding 
list of the actual addresses listed at those locations. When the 
code is compressed, and when it is loaded, those addresses 
must be updated. Accordingly, the reference table is used at 
these times to allow the updating. 

However, when the code is compressed and shuffled, the 
actual bits of the addresses are separated from each other and 
reordered. Therefore, the scatter table lists, for each address 
in the reference table, where EACH BIT is located. In the 
preferred embodiment the table lists, a width of a bit field, 
an offset from the corresponding index of the address in the 
source text, a corresponding oflset from the corresponding 
index in the address in the destination text. 

When object module HI is loaded to run on the processor, 65 
the scatter table allows the addresses listed in the reference 
table to be updated even before the bits are deshuffled. 



60 



Operations have the format as in Table III 

Appendix A is VERILOG code which specifies the func- 
tioning of the decompression unit. VERILOG code is a 
standard format used as input to the VERILOG simulator 
produced by Cadence Design Systems, Inc. of San Jose, 
Calif. The code can also be input directly to the design 
compiler made by Synopsys of Mountain View, Calif, to 
create circuit diagrams of a decompression unit which will 
decompress the code. The VERILOG code specifies a list of 
pins of the decompression unit. These pins are listed in 
TABLE V below: 



TABLE V 



# of pins 


name of group 




in group 


of pins 


description of group of pins 


512 


data512 


512 bit input data word from 






memory, i.e. either from the 






instruction cache or the main 






memory 


32 


PC 


input program counter 


44 


operation4 


output contents of issue slot 4 


44 


operation3 


output contents of issue slot 3 


44 


operation2 


output contents of issue slot 2 


44 


operationl 


output contents of issue slot 1 


44 


operationO 


output contents of issue slot 0 


10 


format__out 


output duplicate of format bits 






in operations 


32 


first word 


output first 32 bits pointed to 






by program counter 


1 


format ctrlO 


is it a branch target or not? 


1, each 


re issue 1 


input global pipeline control 




stall in 


signals 




freeze 






reset 






elk 





Data512 is a double word which contains an instruction 
which is currently of interest. In the above, the program 
counter, PC is used to determine data512 according to the 
following algorithm: 
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COMPRESSED INSTRUCTION FORMAT 
FOR USE IN A VLIW PROCESSOR 

REFERENCE TO MICROFICHE APPENDIX 

This application has a microfiche appendix having 2 
sheets of fiches and 66 frames. 

BACKGROUND OF THE INVENTION 

a. Field of the Invention 

The invention relates to VLIW (Very Long Instruction 
Word) processors and in particular to instruction formats for 
such processors and apparatus for processing such instruc- 
tion formats. 

b. Background of the Invention 

VLIW processors have instruction words including a 
plurality of issue slots. The processors also include a plu- 
rality of functional units. Each functional unit is for execut- 
ing a set of operations of a given type. Each functional unit 
is RISC-like in that it can begin an iastruction in each 
machine cycle in a pipe-lined manner. Each issue slot is for 
holding a respective operation. All of the operations in a 
same instruction word are to be begun in parallel on the 
functional unit in a single cycle of the processor. Thus the 
VLIW implements line-grained parallelism. 

Thus, typically an instruction on a VLIW machine 
includes a plurality of operations. On conventional 
machines, each operation might be referred to as a separate 
instruction. However, in the VLIW machine, each instruc- 
tion is composed of operations or no-ops (dummy 
operations). 

Like conventional processors, VLIW processors use a 
memory device, such as a disk drive to store instruction 
streams for execution on the processor. A VLIW processor 
can also use caches, like conventional processors, to store 
pieces of the instruction streams with high bandwidth acces- 
sibility to the processor. 

The instruction in the VLIW machine is built up by a 
programmer or compiler out of these operations. Thus the 
scheduling in the VLIW processor is software-controlled. 

The VLIW processor can be compared with other types of 
parallel processors such as vector processors and superscalar 
processors as follows. Vector processors have single opera- 
tions which are performed on multiple data items simulta- 
neously. Superscalar processors implement fine-grained 
parallelism, hke the VLIW processors, but unlike the VLIW 
processor, the superscalar processor schedules operations in 
hardware. 

Because of the long instruction words, the VLIW proces- 
sor has aggravated problems with cache use. In particular, 
large code size causes cache misses, i.e. situations where 
needed inslructioas are not in cache. Large code size also 
requires a higher main memory bandwidth to transfer code 
from the main memory to the cache. Large code size can be 
aggravated by the following factors. 

In order to fine tune programs for optimal running, 
techniques such as grafting, loop unrolling, and proce- 
dure inlining are used. These procedures increase code 
size. 

Not all issue slots are used in each instruction. A good 
optimizing compiler can reduce the number of unused 
issue slots; however a certain number of no-ops 
(dummy instructions) will continue to be present in 
most instruction streams. 

In order to optimize use of the functional units, operations 
on conditional branches are typically begun prior to 



10 



15 



20 



25 



30 



35 



40 



45 



50 



55 



60 



65 



expiration of the branch delay, i.e. before it is known 
which branch is going to be taken. To resolve which 
results are actually to be used, guard bits are included 
with the instructions. 

Larger register files, preferably used on newer processor 
types, require longer addresses, which have to be 
included with operations. 
A scheme for compression of VLIW instructions has been 
proposed in U.S. Pat. Nos 5,179,680 and 5,057,837. This 
compression scheme eliminates unused operations in an 
instruction word using a mask word, but there is more room 
to compress the instruction. 

SUMMARY OF THE INVENTION 

It is an object of the invention to reduce code size in a 
VLIW processor. 

This object is met by using a compression scheme in 
which, within an instruction having a plurality of operations, 
each operation is compressed. Compression includes assign- 
ing a compressed operation length to the operation. The 
compression includes choosing one of a plurality of finite 
lengths. The finite lengths include at least one non-zero 
length. Which length is chosen depends on a feature of the 
operation. 

Branch targets are not compressed. For each instruction, 
information about compression format is stored in a previ- 
ous iastruction. 

FURTHER INFORMATION ABOUT 
TECHNICAL BACKGROUND TO THIS 
APPLICATION 

The following prior applications are incorporated herein 
by reference: 

U.S. patent application Ser. No. 07/998,080, filed Dec. 29, 

1992 (PHA 21,777), now abandoned, which shows a 
VLIW processor architecture for implementing fine- 
grained parallelism; 

U.S. patent application Ser. No. 07/142,648 filed Oct. 25, 

1993 (PHA 1205), now U.S. Pat. No. 5,450,556, which 
shows use of guard bits; and 

U.S. patent application Ser. No. 07/366,958 filed Dec. 30, 

1994 (PHA 21,932) which shows a register file for use 
with VLIW architecture. 

Bibliography of program compression techniques: 

J. Wang et al, "The Feasibility of Using Compression to 
Increase Memory System Performance", Proc. 2nd Int. 
Workshop on Modeling Analysis, and Simulation of 
Computer and Telecommunications Systems, p. 
107-113 (Durham, N.C., USA 1994); 

H. Schroder et al., "Program compression on the instruc- 
tion systolic array", Parallel Computing, vol. 17, n 2-3, 
June 1991, p. 207-219; 

A. Wolfe et al., "Executing Compressed Programs on an 
Embedded RISC Architecture", J. Computer and Soft- 
ware Engineering, vol. 2, no. 3, pp. 315-27, (1994); 

M. Kozuch et al., "Compression of Embedded Systems 
Programs", Proc. 1994 IEEE Int. Conf. on Computer 
Design: VLSI in Computers and Processors (Oct. 
10-12, 1994, Cambridge Mass., USA) pp. 270-7. 
Typically the approach adopted in these documents has been 
to attempt to compress a program as a whole or blocks of 
program code. Moreover, typically some table of instruction 
locations or locations of blocks of instructions is necessi- 
tated by these approaches. 
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BRIEF DESCRIPTION OF THE DRAWING 

The invention will now be described by way of non- 
limitative example with reference to the following figures: 

FIG. la shows a processor for using the compressed 5 
instruction format of the invention. 

FIG. lb shows more detail of the CPU of the processor of 
FIG. la. 

FIGS. 2a-2e show possible positions of instructions in 
cache. ]0 

FIG. 3 illustrates a part of the compression scheme in 
accordance with the invention. 

FIGS. 4*7-4/ illustrate examples of compressed instruc- 
tions in accordance with the invention. 

FIGS. Sa~5b give a table of compressed instructions 
formats according lo the invention. 

FIG. 6a is a schematic showing the functioning of instruc- 
tion cache 103 on input. 

FIG. 6b is a schematic showing the functioning of a 20 
portion of the instruction cache 103 on output. 

FIG. 7 is a schematic showing the functioning of instruc- 
tion cache 104 on output. 

FIG. 8 illustrates compilation and linking of code accord- 
ing to the invention. 

FIG. 9 is a flow chart of compression and shuffling 
modules. 

FIG. 10 expands box 902 of FIG. 9. 
FIG. 11 expands box 1005 of FIG. 10. 
FIG. 12 illustrates the decompression process. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENT 

FIG. la shows the general structure of a processor accord- 
ing to the invention. A microprocessor according to the 
invention includes a CPU 102, an instruction cache 103, and 
a data cache 105. The CPU is connected to the caches by 
high bandwidth buses. The microprocessor also contains a 40 
memory 104 where an instruction stream is stored. 

The cache 103 is structured to have 512 bit double words. 
The individual bytes in the words are addressable, but the 
bits are not. Bytes are 8 bits long. Preferably the double 
words are accessible as a single word in a single clock cycle. 45 

The instruction stream is stored as instructions in a 
compressed format in accordance with the invention. The 
compressed format is used both in the memory 104 and in 
the cache 103. 

FIG. lb shows more detail of the VLI W processor accord- 
ing to the invention. The processor includes a multiport 
register file 150, a number of functional units 151, 152, 
153, . . . , and an instruction issue register 152. The multiport 
register file stores results from and operands for the func- 
tional units. The instruction issue register includes a plural- 
ity of issue slots for containing operations to be commenced 
in a single clock cycle, in parallel, on the functional units 
151, 152, 153, .... A decompression unit 155, explained 
more fully below, converts the compressed instructions from 
the instruction cache 103 into a form usable by the IIR 154. 

COMPRESSED INSTRUCTION FORMAT 
1. General Characteristics 

The preferred embodiment of the claimed instruction 
format is optimized for use in a VLIW machine having an 65 
instruction word which contains 5 issue slots. The format 
has the following characteristics 
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unaligned, variable length instructions; 
variable number of operations per instruction; 

3 possible sizes of operations: 26, 34 or 42 bits (also 

called a 26/34/42 format), 
the 32 most frequently used operations are encoded more 

compactly than the other operations; 
operations can be guarded or unguarded; 
operations are one of zeroary, unary, or binary, i.e. they 

have 0, 1 or 2 operands; 
operations can be resultless; 

operations can contain immediate parameters having 7 or 
32 bits 

branch targets arc not compressed; and 

format bits for an instruction are located in the prior 
instruction. 
2. Instruction Alignment 

Except for branch targets, instructions are stored aligned 
on byte boundaries in cache and main memory. Instructions 
are unaligned with respect to word or block boundaries in 
either cache or main memory. Unaligned instruction cache 
access is therefore needed. 

In order to retrieve unaligned instructions, processor 
retrieves one word per clock cycle from the cache. 

As will be seen from the compression format described 
below, branch targets need to be uncompressed and must fall 
within a single word of the cache, so that they can be 
retrieved in a single clock cycle. Branch targets are aligned 
by the compiler or programmer according to the following 
rule: 

if a word boundary falls within the branch target or 
exactly at the end of the branch target, padding is added 
to make the branch target start at the next word bound- 
ary 

Because the preferred cache retrieves double words in a 
single clock cycle, the rule above can be modified to 
substitute double word boundaries for word boundaries. _ 

The normal unaligned instructions are retrieved so that 
succeeding instructions are assembled from the tail portion 
of the current word and an initial portion of the succeeding 
word. Similarly, all subsequent instructions may be 
assembled from 2 cache words, retrieving an additional 
word in each clock cycle. 

This means that whenever code segments are relocated 
(for instance in the linker or in the loader) alignment must 
be maintained. This can be achieved by relocating base 
addresses of the code segments to multiples of the cache 
block size. 

FIGS. 2a~e show unaligned instruction storage in cache 
in accordance with the invention. 

FIG. 2a shows two cache words with three instructions il, 
i2, and i3 in accordance with the invention. The instructions 
are unaligned with respect to word boundaries. Instructions 
il and i2 can be branch targets, because they fall entirely 
within a cache word. Instruction i3 crosses a word boundary 
and therefore must not be a branch target. For the purposes 
of these examples, however, it will be assumed that il and' 
only il is a branch target. 

FIG. 2b shows an impermissible situation. Branch target 
il crosses a word boundary. Accordingly, the compiler or 
programmer must shift the instruction il to a word boundary 
and fill the open area with padding bytes, as shown in FIG. 
2c. 

FIG. 2d shows another impermissible situation. Branch 
target instruction il ends precisely at a word boundary. In 
this situation, again it. must be moved over to a word 
boundary and the open area filled with padding as shown in 
FIG. 2e. 
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Branch targets must be instructions, rather than operations 
within instructions. The instruction compression techniques 
described below generally eliminate no-ops (dummy 
instructions). However, because the branch target instruc- 
tions are uncompressed, they must contain no-ops to fill the 
issue slots which are not to be used by the processor. 
3. Bit and Byte order 

Throughout this application bit and byte order are little 
endian. Bits and bytes are listed with the least significant bits 
first, as below: 



Bit number 
Byte number 
address 



0 



16 



0 
0 



2 



4. Instruction format 

The compressed instruction can have up to seven types of 
fields. These are listed below. The format bits are the only 
mandatory field. 

Ihe instructions are composed of byte aligned sections. 
The first two bytes contain the format bits and the first group 
of 2-bit operation parts. All of the other fields are integral 
multiples of a byte, except for the second 2-bit operation 
parts which contain padding bits. 

The operations, as explained above can have 26, 34, or 42 
bits. 26-bit operations are broken up into a 2-bit part to be 
stored with the format bits and a 24-bit part. 34-bit opera- 
tions are broken up into a 2 bit part, a 24-bit part, and a one 
byte extension. 42-bit operations are broken up into a 2 bit 
part, a 24 bit part, and a two byte extension. 

A. Format bits 

These are described in section 5 below. With a 5 issue slot 
machine, 10 format bits are needed. Thus, one byte plus two 
bits are used. 

B. 2-bit operation parts, first group 

While most of each operation is stored in the 24-bit part 
explained below, i.e. 3 bytes, with the preferred instruction 
set 24 bits was not adequate. The shortest operations 
required 26 bits. Accordingly, it was found that the six bits 
left over in the bytes for the format bit field could advan- 
tageously be used to store extra bits from the operations, two 
bits for each of three operations. If the six bits designated for 
the 2-bit parts are not needed, they can be filled with padding 
bits. 

C 24-bit operation parts, first group 

There will be as many 24 bit operation parts as there were 
2 bit operation parts in the two bit operation parts, first 
group. In other words, up to three 3 byte operation parts can 
be stored here. 

D. 2 bit operation parts, second group 

In machines with more than 3 issue slots a second group 
of 2-bil and 24-bit operation parts is necessary. The second 
group of 2-bit parts consists of a byte with 4 sets of 2-bit 
parts. If any issue slot is unused, its bit positions are filled 
with padding bits. Padding bits sit on the left side of the byte. 
In a five issue slot machine, with all slots used, this section 
would contain 4 padding bits followed by two groups of 
2-bit parts. The five issue slots are spread out over the two 
groups: 3 issue slots in the first group and 2 Issue slots in the 
second group. 

E. 24-bit operation parts, second group 

The group of 2-bit parts is followed by a corresponding 
group of 24 bit operation parts. In a five issue slot machine 
with all slots used, there would be two 24-bit parts in this 
group. 

F. further groups of 2-bit and 24-bit parts 

In a very wide machine, i.e. more than 6 issue slots, 
further groups of 2-bit and 24-bit operation parts arc nec- 
essary. 



10 



15 



20 



25 



30 



35 



40 



45 



50 



55 



60 



65 



G. Operation extension 

At the end of the instruction there is a byte -aligned group 
of optional 8 or 16 bit operation extensions, each of them 
byte aligned. The extensions are used to extend the size of 
the operations from the basic 26 bit to 34 or 42 bit, if needed. 



<instruction> ::= 

<inst ruction start> 

<instruction mid die > 

<inst ruction end> 

< instruct ion ex tens ion > 
instruction start> ::= 

<Format:2*N>{<padding:l:>} V2{<2-bit operation part:2>}Vl {<:24- 

bit operation part:24>}Vl 
<instruction middlo {{<2-bit operation part:2>}4 (24-bit 

operation part:24>}4}V3 
instruction end> ::= (<padding:l>}V5{<2-bit operation 
pait:2>}V4 {24-bit operation part:24>}V4 
instruction extensions :={<operationextension:0/S/16>}S 
<padding>::= "0" 



Wherein the variables used above are defined as follows: 
N=the number of issue slots of the machine, N>1 
S=the number of issue slots used in this instruction 

(O^S^N) 
CU4-(N mod 4) 

If (Si CI) then VI =S and V2=2*(C1-V1) 
If (S>C1) then Vl-Cl and V2=0 
V3=(S-V1) div 4 
V4=(S-V1) mod 4 

If (V4>0) then V5=2*(4-V4) else V5=0 
Explanation of notation 



::= means 
<field name:mimber> 
means 

{<field nam e>} number 
means 



"0" 
H div" 
"mod" 
:0/S/16 



means 
means 
means 



"is defined as" 

the field indicated before the colon has the 
number of bits indicated after the colon. 

the field indicated in the angle brackets and 

braces is repeated the number of times 

indicated after the braces 

the bit "0". 

integer divide 

modulo 



means that the field is 0, 8, or 16 bits long 



Examples of compressed instructions are shown in FIGS. 
4a-f. 

FIG. 4a shows an instruction with no operations. The 
instruction contains two bytes, including 10 bits for the 
format field and 6 bits which contain only padding. The 
former is present in all the instructions. The latter normally 
correspond to the 2-bit operation parts. The X's at the top of 
the bit field indicate that the fields contain padding. In the 
later figures, an O is used to indicate that the fields are used. 

FIG. 4b shows an instruction with one 26-bit operation. 
The operation includes one 24 bit part at bytes 3-5 and one 
2 bit part in byte 2. The 2 bits which are used are marked 
with an O at the top. 

FIG. 4c shows an instruction with two 26-bit operations. 
The first 26-bit operation has its 24-bit part in bytes 3-5 and 
its extra two bits in the last of the 2-bil part fields. The 
second 26-bit operation has its 24-bit part in bytes 6-8 and 
its extra two bits in the second to last of the 2-bit part fields. 

FIG. 4<l shows an instruction with three 26-bil operatioas. 
The 24-bit parts are located in bytes 3-11 and the 2-bit parts 
are located in byte 2 in reversed order from the 24-bit parts. 

FIG. 4c shows an instruction with four operations. The 
second operation has a 2 byte extension. The fourth opera- 



